Change from concurrent.futures to multiprocessing#354
Open
donglihe-hub wants to merge 1 commit intoASUS-AICS:masterfrom
Open
Change from concurrent.futures to multiprocessing#354donglihe-hub wants to merge 1 commit intoASUS-AICS:masterfrom
donglihe-hub wants to merge 1 commit intoASUS-AICS:masterfrom
Conversation
5c58aa6 to
206f6f5
Compare
7941e41 to
763d476
Compare
763d476 to
271fe7a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Change from a higher level interface to a lower one because:
concurrent.futures.ProcessPoolExecutorpool deadlocks when submitting many tasks python/cpython#105829, when there are many tasks submitted to aconcurrent.futures.ProcessPoolExecutorpool, there is probability that deadlocks will occur with CPython. The same example run withmultiprocessing.pool.Poolhad no such problem.Trying to find out the best num_processes
I tested tokenization with various num_processes (using RegexTokenization):
linux, fork
Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases.
Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.
Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).
I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).
win32, spawn
I also tested spawn on linux
linux, spawn
It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.
Test CLI & API (
bash tests/autotest.sh)Test APIs used by main.py.
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
Test quickstart & API (
bash tests/docs/test_changed_document.sh)If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.